Respect Retry-After header in OpenAI retry decorator#20813
Open
debu-sinha wants to merge 3 commits intorun-llama:mainfrom
Open
Respect Retry-After header in OpenAI retry decorator#20813debu-sinha wants to merge 3 commits intorun-llama:mainfrom
debu-sinha wants to merge 3 commits intorun-llama:mainfrom
Conversation
AstraBert
approved these changes
Mar 2, 2026
Member
AstraBert
left a comment
There was a problem hiding this comment.
Looks good, as usual you need to bump the version of the integrations you have modified in order for them to be published
The retry decorator for both OpenAI LLM and embeddings integrations previously used a fixed exponential backoff for all retryable errors, including RateLimitError. When the server sends a Retry-After header, the client should wait the specified duration instead of guessing with exponential backoff. This adds a custom tenacity wait strategy (_WaitRetryAfter) that extracts the Retry-After header from RateLimitError responses and uses it as the sleep duration, capped at 120 seconds. For all other errors or when the header is missing, it falls back to the existing exponential backoff behavior. Fixes run-llama#15649 Signed-off-by: debu-sinha <debusinha2009@gmail.com>
Signed-off-by: debu-sinha <debusinha2009@gmail.com>
Signed-off-by: debu-sinha <debusinha2009@gmail.com>
77b500b to
571f4e8
Compare
Contributor
Author
|
Good call, bumped both packages:
Also rebased on latest main. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #15649
Description
The OpenAI retry decorator for both LLM and embeddings integrations currently uses a fixed exponential backoff for all retryable errors. When the server responds with a 429 status and includes a
Retry-Afterheader, the client should wait the server-specified duration instead of guessing with exponential backoff.Without this fix, the retry loop can either wait too long (wasting time when the server says "try again in 1 second" but backoff says "wait 30 seconds") or not long enough (retrying before the rate limit window resets, burning through all retries uselessly).
Changes
New:
_WaitRetryAfterwait strategy (added to both embeddings and LLMutils.py):wait_baseto integrate cleanly with the existing retry stackRateLimitError: extractsRetry-Afterfromresponse.headers(httpx.Headers, case-insensitive)New:
_parse_retry_afterhelper:No breaking changes: The function signature of
create_retry_decorator()is unchanged. Existing behavior is preserved for non-RateLimitError exceptions and when the Retry-After header is absent.Files Changed
llama-index-integrations/embeddings/.../openai/utils.py_WaitRetryAfter,_parse_retry_after, updatedcreate_retry_decoratorllama-index-integrations/llms/.../openai/utils.pyllama-index-integrations/embeddings/.../tests/test_retry_after.pyllama-index-integrations/llms/.../tests/test_retry_after.pyTesting
35 new unit and integration tests covering:
All existing tests pass unchanged.
Context
This is a follow-up to #14801 / PR #20712 (token-bucket rate limiter) which added proactive rate limiting. This PR addresses the reactive side: when a 429 does occur, the client now waits the exact amount of time the server specifies rather than guessing.
Azure OpenAI inherits from OpenAI (
class AzureOpenAI(OpenAI)) so this fix applies to Azure automatically.